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By determining the most common English words and phrases since the beginning of the six- 
teenth century, we obtain a unique large-scale view of the evolution of written text. We find 
that the most common words and phrases in any given year had a much shorter popularity 
lifespan in the sixteenth century than they had in the twentieth century. By measuring how 
their usage propagated across the years, we show that for the past two centuries, the process 
has been governed by linear preferential attachment. Along with the steady growth of the 
English lexicon, this provides an empirical explanation for the ubiquity of Zipf's law in 
language statistics and confirms that writing, although undoubtedly an expression of art 
and skill, is not immune to the same influences of self-organization that are known to regulate 
processes as diverse as the making of new friends and World Wide Web growth. 
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1. INTRODUCTION 

The evolution of language [1-7] is, much like the evol- 
ution of cooperation [8,9], something that markedly 
distinguishes humans from other species [10,11]. While 
the successful evolution of cooperation enables us to 
harvest the benefits of collective efforts on an unprece- 
dented scale, the evolution of language, along with the 
set of grammatical rules [12] that allows infinitely many 
comprehensible formulations [13-16], enables us to 
uphold a cumulative culture [17]. Were it not for 
books, periodicals and other publications, we would 
hardly be able to continuously elaborate over what is 
handed over by previous generations, and, consequen- 
tly, the diversity and efficiency of our products would 
be much lower than it is today. Indeed, it seems like 
the importance of the written word for where we 
stand today as a species cannot be overstated. 

The availability of vast amounts of digitized data, also 
referred to as 'metaknowledge' or 'big data' [18], along 
with the recent advances in the theory and modelling of 
social systems in the broadest possible sense [19,20], 
enables quantitative explorations of the human culture 
that were unimaginable even a decade ago. From 
human mobility patterns [21,22], crashes in financial 
markets [23] and in our economic life [24,25], the spread 
of infectious diseases [26-28] and malware [29,30], the 
dynamics of online popularity [31] and social movements 
[32], to scientific correspondence [33,34], there appear to 
be no limits to insightful explorations that lift the veil on 
how we as humans behave, interact, communicate and 
shape our very existence. 

Much of what we have learned from these studies 
strongly supports the fact that universal laws of 
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organization govern how nature, as well as we as a 
society, work [35,36]. Languages, as comprehensively 
reviewed by Sole et al. [37], and as suggested already 
by Zipf [38] as well as by others before him [39], are 
certainly no exception. In fact, in many ways, it seems 
more like it is the other way around. Zipf's law is 
frequently related to the occurrence of power-law 
distributions in empirical data [40], with examples ran- 
ging from income rankings and population counts of 
cities to avalanche and forest-fire sizes [41]. Yet the 
mechanisms that may lead to the emergence of scaling 
in various systems differ. The proposal made by Zipf 
was that there is tension between the efforts of the 
speaker and the listener, and it has been shown that 
this may indeed explain the origins of scaling in the 
human language [42]. The model proposed by Yule 
[43], relying on the rich-get-richer phenomenon (see 
[44] for a review), is also frequently cited as the reason 
for the emergence of Zipf's law. With the advent of con- 
temporary network science [45-47], however, growth 
and preferential attachment, used ingeniously by 
Barabasi & Albert [46] to explain the emergence of scal- 
ing in random networks, has received overwhelming 
attention, also in relation to the emergence of Zipf's 
law in different corpora of the natural language [48,49]. 

Here we make use of the data that accompanied the 
seminal study by Michel et al. [50], and show empiri- 
cally, based on a large-scale statistical analysis of 
the evolution of the usage of the most common words 
and phrases in the corpus of the English books over 
the past five centuries, that growth and preferential 
attachment played a central role in determining 
the longevity of popularity and the emergence of scaling 
in the examined corpus. The presented results support 
previous theoretical studies [37] and indicate that 
writing, on a large scale, is subject to the same 
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rank of the 1-gram 



Figure 1 . Confirmation of Zipf 's law in the examined corpus. By 
measuring the frequency of 1-grams in the n-grams, where n > 2 
(refer to key), we find that it is inversely proportional to the 
rank of the 1-grams. For all n, the depicted curves decay with 
a slope of —1 on a double log scale over several orders of 
magnitude, thus confirming the validity of Zipf's law in the 
examined dataset. 



fundamental laws of organization that determine so 
many other aspects of our existence. 



2. RESULTS 

Henceforth we will, for practical reasons, refer to the 
words and phrases as n-grams [50] , with the meaning as 
described in appendix A. We begin with presenting 
the results of a direct test of Zipf's law for the overall 
most common 1-grams in the English corpus since the 
beginning of the sixteenth century. For this purpose, 
we treat the n-grams for different n > 1 as individual cor- 
pora where the frequencies of the 1-grams are to be 
determined. Results presented in figure 1 confirm that, 
irrespective of n, the frequency of any given 1-gram is 
roughly inversely proportional to its rank. The ragged 
outlay of the curves is a consequence of the rather special 
construction of the corpora on which this test was per- 
formed. Yet, given the time span and the extent of the 
data, this is surely a very satisfiable outcome of a test 
for a century-old law [39,51] on such a large scale, validat- 
ing the dataset against the hallmark statistical property 
of the human language. 

Turning to the evolution of popularity, we show in 
figure 2 how the rank of the top 100 n-grams, as deter- 
mined in the years 1520, 1604, 1700, 1800 and 1900, 
varied until the beginning of the next century. 
During the sixteenth and the seventeenth centuries, 
popularity was very fleeting. Phrases that were used 
most frequently in 1520, for example, only intermit- 
tently succeeded in re-entering the charts in the later 
years, despite the fact that we have kept track of the top 
10 000 n-grams and have started with the top 100 
n-grams in each of the considering starting years. It 
was not before the end of the eighteenth century that 
the top 100 n-grams gradually began succeeding in 
transferring their start-up ranks over to the next cen- 
tury. The longevity and persistency of popularity is 



the highest during the twentieth century, which is 
also the last one for which data are available, apart 
from the 8 years into the twenty-first century. Com- 
paring the different n-grams with one another, we find 
that the 1-grams were always, regardless of the century 
considered, more likely to retain their top rankings than 
the 3-grams, which in turn outperformed the 5-grams. 
This, however, is an expected result, given that single 
words and short phrases are obviously more likely to 
be reused than phrases consisting of three, four or 
even, five words. 

Although the fleeting nature of the top rankings 
recorded in the sixteenth and the seventeenth centuries 
is, to a degree, surely a consequence of the relatively 
sparse data (only a few books per year) if compared 
with the nineteenth and the twentieth centuries, it 
nevertheless appears intriguing as it is based on the 
relative yearly usage frequencies of the n-grams. Thus, 
at least a 'statistical' coming of age of the written 
word imposes as a viable interpretation. To quantify 
it accurately, we have conducted the same analysis as 
presented in figure 2 for the top 1000 n-grams for all 
years with data, and subsequently calculating the aver- 
age standard deviation of the resulting 1000 curves for 
each starting year. Symbols presented in figure 3 
depict the results of this analysis separately for all the 
n-grams. A sharp transition towards a higher consist- 
ency of the rankings occurs at the brink of the 
nineteenth century for all n, thus giving results pre- 
sented in figure 2 a more accurate quantitative frame. 
These results remain valid if the rankings are traced 
only 50 years into the future, as well as if performing 
the same analysis backwards in time, as evidenced by 
the thick grey line depicting a moving average over 
this four scenarios as well as over all the n. 

Both the validity of Zipf's law across all the data 
considered in this study, as well as the peculiar evolu- 
tion of popularity of the most frequently used n-grams 
over the past five centuries, hint towards large-scale 
organization gradually emerging in the writing of the 
English books. Since the groundbreaking work by 
Barabasi and Albert on the emergence of scaling in 
random networks [46], growth and preferential attach- 
ment has become synonymous for the emergence of 
power laws and leadership in complex systems. Here 
we adopt this beautiful perspective and test whether 
it holds true also for the number of occurrences of the 
most common words and phrases in the English books 
that were published in the past five centuries. In the 
seminal paper introducing culturomics [50], it was 
pointed out that the size of the English lexicon has 
grown by 33 per cent during the twentieth century 
alone. As for preferential attachment, we present in 
figure 4 evidence indicating that the higher the number 
of occurrences of any given n-gram, the higher the prob- 
ability that it will occur even more frequently in the 
future. More precisely, for the past two centuries, the 
points quantifying the attachment rate follow a linear 
dependence, thus confirming that both growth and 
linear preferential attachment are indeed the two pro- 
cesses governing the large-scale organization of writing. 
Performing the same analysis for the preceding three cen- 
turies fails to deliver the same conclusion, although the 
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Figure 2. Evolution of popularity of the top 100 n-grams over the past five centuries. For each of the 5 starting years, being 1520, 
1600, 1700, 1800 and 1900 from left to right (separated by dashed grey lines), the rank of the top 100 n-grams was followed until it 
exceeded 10 000 or until the end of the century. From top to bottom, the panels depict results for different n, as indicated verti- 
cally. The advent of the nineteenth century marks a turning point after which the rankings began to gain markedly on 
consistency. Regardless of which century is considered, the higher the n the more fleeting the popularity. Tables listing the 
top n-grams for all available years are available at http://www.matjazperc.com/ngrams. 



seed for what will eventually emerge as linear preferential 
attachment is clearly inferable. 

3. DISCUSSION 

The question 'Which are the most common words and 
phrases of the English language?' alone has a certain 
appeal, especially if one is able to use digitized data from 
millions of books dating as far back as the early sixteenth 
century [50] to answer it. On the other hand, writing about 
the evolution of a language without considering grammar 
or syntax [13] , or even without being sure that all the con- 
sidered words and phrases actually have a meaning, may 
appear prohibitive to many outside the physics commu- 
nity. Yet, it is precisely this detachment from detail and 
the sheer scale of the analysis that enables the observation 
of universal laws that govern the large-scale organization 
of the written word. This does not mean that the presented 



results are no longer valid if we made sure to analyse only 
words and phrases that actually have meaning or if we had 
distinguished between capitalized words, but rather that 
such details do not play a decisive role in our analysis. 
Regardless of whether a word is an adjective or a noun, 
or whether it is currently trendy or not, with the years 
passing by the mechanism of preferential attachment 
will make sure that the word will obtain its rightful 
place in the overall rankings. Together with the continu- 
ous growth of the English lexicon, we have a blueprint for 
the emergence of Zipf 's law that is derived from a vast 
amount of empirical data and supported by theory [46] . 
This does not diminish the relevance of the tension 
between the efforts of the speaker and the listener [42], 
but adds to the importance of the analysis of 'big data' 
with methods of statistical physics [52,53] and net- 
work science [48,49,54] for our understanding of the 
large-scale dynamics of human language. 
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Figure 3. 'Statistical' coming of age of the English language. 
Symbols depict results for different n (refer to key), as obtained 
by calculating the average standard deviation of the rank for the 
top 1000 n-grams 100 years into the future. The thick grey line is 
a moving average over all the n-grams and over the analysis 
going 50 and 100 years into the future as well as backwards. 
There is a sharp transition to a greater maturity of the rankings 
taking place at around the year 1800. Although the moving 
average softens the transition, it confirm that the 'statistical' 
coming of age was taking place and that the nineteenth century 
was crucial in this respect. 
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Figure 4. Emergence of linear preferential attachment during 
the past two centuries. Based on the preceding evolution of 
popularity, two time periods were considered separately, as 
indicated in the figure legend. While preferential attachment 
appears to have been in place already during the 1520-1800 
period, large deviations from the linear dependence (the good- 
ness-of-fit is ~0.05) hint towards inconsistencies that may 
have resulted in heavily fluctuated rankings. The same analy- 
sis for the nineteenth and the twentieth centuries provides 
much more conclusive results. For all n the data fall nicely 
onto straight lines (the goodness-of-fit is ^0.8), thus indicat- 
ing that continuous growth and linear preferential attachment 
have shaped the large-scale organization of the writing of 
English books over the past two centuries. Results for those 
n-grams that are not depicted are qualitatively identical for 
both periods of time. 



The allure of universal laws that might describe the 
workings of our society is large [35]. Observing Zipf's 
law [38], or more generally a power-law distribution 



[41], in a dataset is an indication that some form of 
large-scale self-organization might be taking place in 
the examined system. Implying that initial advantages 
are often self-amplifying and tend to snowball over 
time, preferential attachment, known also as the rich- 
get-richer phenomenon [43], the 'Matthew effect' [55], 
or the cumulative advantage [56], has been confirmed 
empirically by the accumulation of citations [57] and 
scientific collaborators [58,59], by the growth of the 
World Wide Web [36], and by the longevity of one's 
career [60]. Examples based solely on theoretical argu- 
ments, however, are many more and much easier to 
come by. Empirical validations of preferential attach- 
ment require large amounts of data with time stamps 
included. It is the increasing availability of such data- 
sets that appears to fuel progress in fields ranging 
from cell biology to software design [61], and as this 
study shows, it helps reveal why the overall rankings 
of the most common English words and phrases are 
unlikely to change in the near future, as well as why 
Zipf's law emerges in written text. 

This research was supported by the Slovenian Research 
Agency (grant no. Jl-4055). 

APPENDIX A. METHODS 
A.l. Raw data 

The seminal study by Michel et al. [50] was accompanied 
by the release of a vast amount of data composed of 
metrics derived from approximately 4 per cent of books 
ever published. Raw data, along with usage instructions, 
are available at http://books.google.com/ngrams/ 
datasets as counts of n-grams that appeared in a given 
corpus of books published in each year. An n-gram is 
made up of a series of n 1-grams, and a 1-gram is a 
string of characters uninterrupted by a space. Although 
we have excluded 1-grams that are obviously not words 
(for example, if containing characters outside the range 
of the ASCII table) from the analysis, some (mostly 
typos) might have nevertheless found their way into the 
top rankings. The latter were composed by recursively 
scanning all the files from the English corpus associated 
with a given n in the search for those n-grams that had 
the highest usage frequencies in any given year. Tables 
listing the top 100, top 1000 and top 10 000 n-grams 
for all available years since 1520 inclusive, along with 
their yearly usage frequencies and direct links to the 
Google Books Ngram Viewer, are available at http:// 
www.matjazperc.com/ ngrams. 

A. 2. Zipf's law 

Taking the top 10 000 n-grams for all available years 
as the basis, we have determined the number of unique 
n-grams through the centuries and ranked them accord- 
ing to the total number of occurrences in the whole 
corpus during all the years. In this way, we have obtained 
a list of 148 557 unique 1-grams, 291 661 unique 2-grams, 
482 503 unique 3-grams, 742 636 unique 4-grams and 
979 225 unique 5-grams. This dataset was used for test- 
ing Zipf's law by searching for the overall top ranked 
1-grams in all the other n-grams (n > 1) and recording 
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their frequency of occurrence. For example, the 1-gram 
'the' appeared in 22 826 of the 291661 2-grams, 
hence its frequency is approximately 7.8 per cent. By 
plotting the so obtained frequency in dependence on 
the rank of the 1-grams for n= 2, 3, 4, 5 on a double 
log scale (figure 1), we observe four inversely pro- 
portional curves, thus confirming Zipf's law in the 
constructed dataset. 



A. 3. Attachment rate 

Based on the assumption that the more frequently a given 
n-gram appears, the more linked it is to other n-grams, we 
have determined the attachment rate following network 
science [58] as follows. If an n-gram has appeared m 
times in the year y, and k times in the year y + Ay, the 
attachment rate is a(m) = k/mAy. Note that the occur- 
rences in the dataset are not cumulative. Hence there is 
no difference between k and m in the numerator. More- 
over, by the determination of the attachment rate, we 
are not interested in the relative yearly usage frequencies, 
but rather in the absolute number of times a given n-gram 
has appeared in the corpus in any given year. Thus, m 
and k are not normalized with the total word counts 
per year. We have determined a(m) based on the pro- 
pagation of top 100 n-grams between 1520-1800 and 
1800-2008 with a yearly resolution. Missing years 
were bridged by adjusting Ay accordingly. For the final 
display of the attachment rate in figure 4 and the linear 
fitting, we have averaged a(m) over approximately 200 
non-overlapping segments in m. 
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